README:

It is advice the use of RStudio IDE and R version of atleast 3.5.2, due to the installation process. The script will automatically install all required packages if not already. The relative positioning of all the files and folders should not be tempered with, even output files. Copies should thus be done. For more information or any doubt concerning the given resources, please consider gathering additional insight or contact through https://jorgeserras.com/


About

The current study has the aim of providing further research and resources into the detection and modeling of anomalous behavior in multivariate temporal data. A complete system is built to handle multivariate time series (MTS) from pre-processing to score-analysis known as MultivariatE Time sEries OutlieR (METEOR) (Serras 2018).

All data including time stamps are first pre-processed and discretized to ensure the correct employment of the proposed methods. The developed system is based on Dynamic Bayesian Networks (DBN). METEOR is made available for anomaly detection and followed step-be-step. A web application and tutorial are additionally made accessible (Serras 2018) for any endeavor together with supplementary information.

Furthermore, generated output files are available in folder /resources/output_files where several subsets and transformations of the input dataset are present. Additionally, there is a folder /DBN which has the output files concerning outlier detection using the DBN (METEOR) approach.

Abstract

Outliers can be defined as observations which are suspected of not have been generated by data’s underlying processes. Many applications require a way of identifying interesting or unusual patterns in multivariate time series (MTS), however, most outlier detection methods focus solely on univariate series, providing analysts with strenuous solutions. We propose a complete outlier detection system covering problems since pre-processing that adopts a dynamic Bayesian network modeling algorithm. The latter encodes optimal inter and intra time-slice connectivity of transition networks capable of capturing conditional dependencies in MTS datasets. A sliding window mechanism is employed to gradually score each MTS transition given the model. Simultaneously, complete MTS are evaluated. Two score-analysis strategies are studied to assure an automatic boundary classifying anomalous data. The proposed approach is first validated through simulated data, demonstrating the system’s performance. Comparison with an assembled probabilistic suffix tree method is available displaying the leverage of the proposed multivariate approach. Further experimentation is made on real data, by uncovering anomalies in distinct scenarios such as electrocardiogram series, mortality rate data and written pen digits. The developed system proved beneficial in capturing unusual data resulting from temporal contexts, being suitable for any MTS scenario. A widely accessible web application employing the complete system is made available jointly with a tutorial.


Formatting

To begin, the input data is loaded. If the dataset is not in the correct format, please check the examples and tools made available to ensure a correct employment. The continuous MTS input dataset should be in the /resources folder. The code below must be changed accordingly and the number of variables n_variables specified. If data is already discretized, skip the SAX Discretization phase.

####
original_continuous_data <- as.data.frame(read.csv(file = paste(read_path, "/continuous_example_data.csv", sep = ""), header=TRUE))
n_variables <- 2 # For the SAX plot function, 1 means the subjects are univariate

SAX Discretization

In the first phase, a method known as Symbolic Aggregate approXimation (SAX) (Lin et al. 2003) is applied to discretize time series (TS) into strings. It is capable of transforming an univariate TS of length n after normalization into a string of arbitrary length w. The latter length can be w << n if Piecewise Aggregate Approximation (PAA) is applied. This ensures dimensionality reduction of the normalized TS when excessive measurements are present. An alphabet size must be chosen to determine the number of symbols for the final discretized sequences. Such assumes that normalized TS possess Gaussian distribution probabilities with unitary standard deviation and null mean. For additional information, please see (Lin et al. 2003).

SAX does not allow missing values, imputation of the missing values should be considered as future work. The SAX discretization is the same available in the webapp: https://jorgeserras.shinyapps.io/outlierdetection/

Parameters can be changed in the code below, allowing the adjusting of the alphabet size, PAA reduction size and an auxiliar function for the plot. Feel free to experiment until results are satisfactory.

ATTENTION: It is worth noting that since each TS is normalized, the offset is lost in the process. Such means that it could happen that subjects with extreme values (for example) are brought to zero and be later classified as normal. Visual inspection or the addition of a step before SAX is advised to prevent such scenarios.

### Discretization:
panel_data <- as.data.frame(parseToPanel(original_continuous_data))[,1:(n_variables+2)]

### SAX parameters:
alphabet_size <- 3
paa_size <- as.integer(max(panel_data$timestamp)) #paa_size = ncol() - 1 to prevent dimensionality reduction, -1 because of the "subject_id" column. This is for the univariate case, if data is multivariate, paa_size must be equal to the number of time slices to avoid performing PAA reduction

### SAX parameters: alphabet_size and PAA_size which gives the new reduced numer of time slices for each TS
discrete_outputs <- discretize(panel_data = panel_data, alphabet_size = alphabet_size, paa_size = paa_size)
### Retrieve solely the discretized dataset:
discrete_data <- discrete_outputs[[1]] # The variable names are different but no worries, they are easily recovered later if necessary
normalized_data <- discrete_outputs[[2]]
timestamp <- discrete_outputs[[3]]
subject_id_data <- data.frame(subject_id = discrete_data$subject_id)
### Save in file:
write.csv(discrete_data, file = paste(write_path, "/discrete_data.csv", sep = ""), row.names = FALSE, quote = TRUE)

The results of the discretization can be analyzed in the figure below, which displays the normalized data before preforming PAA reduction (if considered). It shows the mean and standard deviation of each time slice of the dataset. Every TS displayed is normalized with mean = 0 and std = 1. If the number of variables is higher than 1, several curves occur, one for each variable of the MTS. The breakpoints are also visible (pink), delimiting the regions of each symbol of the alphabet chosen. An alphabet size of 3, for example, displays 2 breakpoints separating the entire y domain in 3 equiprobable regions according to a Normal distribution N(0,1).

Results of SAX pre-processing including PAA reduction (if considered). Parameters can be changed below to specify which subjects to observe.

kable(data.frame(discrete_data[1:7,]),caption="SAX discretized subjects")
SAX discretized subjects
subject_id V1__1 V2__1 V1__2 V2__2 V1__3 V2__3 V1__4 V2__4 V1__5 V2__5 V1__6 V2__6 V1__7 V2__7 V1__8 V2__8 V1__9 V2__9 V1__10 V2__10 V1__11 V2__11 V1__12 V2__12 V1__13 V2__13 V1__14 V2__14 V1__15 V2__15 V1__16 V2__16 V1__17 V2__17 V1__18 V2__18 V1__19 V2__19 V1__20 V2__20 V1__21 V2__21 V1__22 V2__22 V1__23 V2__23 V1__24 V2__24 V1__25 V2__25 V1__26 V2__26 V1__27 V2__27 V1__28 V2__28 V1__29 V2__29 V1__30 V2__30 V1__31 V2__31 V1__32 V2__32 V1__33 V2__33 V1__34 V2__34 V1__35 V2__35 V1__36 V2__36 V1__37 V2__37 V1__38 V2__38 V1__39 V2__39
0 c b c a c b c c c c c c b c c c b c c c b c b b b b b b b b b b a b b b a b a b a b a a a a a a a a a a a a a a a a a a a a b a b a b a b b b b c b c b c b
1 c b c a c a c b c c b c b c b c b c b c b c b c b b b b b b b b b b b b b b a b b b a b a b a b a a a a a a a a a a a a a a b a b a b a b a b a c a c b c b
2 c b c b c b c a c b c c c c b c b c b c b c b c b c b c a b b b a b a b a b a b a b a b a b a a a a a a a a a a a a a a a a b a b a b a c a c a c b c b c b
3 c b c b c a c a c b c c c c b c b c c c b c b c b c b b b b b b b b b b a b a b a b a b a b a a a a a a a a a a a a a a a a b a b a b a b a b a c b c b c b
4 c b c b c a c a c b c c c c c c b c c c b c b c b c b b b b b b b b a b a b a b a b a b a b a a a a a a a a a a a a a a a a a a b a b a b a b a b b b b b b
5 c b c b c a c a c c c b c c b c b c b c b c b c b c b b b b a b a b a b a b a b a b a b a a a a a a a a a a a a a a a a a a b a b a c a c a c a c b c b c b
6 c b c b c a c a c b c c c c c c b c c c b c b c b c b b b b b b a b a b a b a b a b a b a b a a a a a a a a a a a a a a a a b a b a b a c a c a c a c b c b

ATTENTION: If PAA reduction was performed, the transition plots displaying outliers as red are not correct, since these display the input data before the reduction. Nevertheless, all computations and conclusions are correct. Users should also remember that they performed PAA, and a transition t classified as anomalous does not correspond to the same transition of the input data (in the case of PAA).

METEOR application (DBN approach)

Many applications require a way of identifying interesting or unusual patterns in multivariate time series (MTS), however, most outlier detection methods focus solely on univariate series, providing analysts with strenuous solutions. We propose a complete outlier detection system known as METEOR (Serras 2018) covering problems since pre-processing that adopts a dynamic Bayesian network (DBN) modeling algorithm (Monteiro, Vinga, and Carvalho 2015). The latter encodes optimal inter and intra time-slice connectivity of transition networks capable of capturing conditional dependencies in MTS datasets. A sliding window mechanism is employed to gradually score each MTS transition given the model. Simultaneously, complete MTS are evaluated. The developed system proved beneficial in capturing unusual data resulting from temporal contexts, being suitable for any MTS scenario. A widely accessible web application (Serras 2018) employing the complete system is made available jointly with a tutorial.

Outliers are considered to be observations suspicious of not have being generated by the data’s main underlying processes. Such instances generally do not fit the trained model. The goal is to score portions as well as entire MTS. For each transition of a subject’s MTS, a window composed by its observed attributes is built, dependent of the order of the trained DBN. A transition score represents the log-likelihood (LL) of the observed window computed using the network’s conditional probabilities. It is worth noting that a transition score does not represent the outlierness of a time slice, but rather its context. A score of an entire TS (subject) is computed using the average of all the transition scores from that subject. The most negative a score is, the more anomalous it is.

By having parameters markov_lag = 1 and previous_parents = 1, for example, each measurement at time t is associated to its immediate precedent (value at slice t-1).

Issues may occur in this phase due to rJava package. This happens due to the still in progress development of the package (bugs and features are still being coded) and Java versions in question. Most of the times, running the script repeatedly fixes the issues (it simply does, don’t ask me). Try to rerun until it works. If problems continue to occur, consider reinstalling rJava and updating Java (unnistalling previous version) or using the online version of the Web application https://jorgeserras.shinyapps.io/outlierdetection/ instead of running through code. The Dynamic Bayesian Network Outlier Detection system for multivariate time series is used. Check the main webpages: https://jorgeserras.shinyapps.io/outlierdetection/ and https://jorgeserras.com/MTS_OutlierDetection/ for additional information. Don’t forget there is a tutorial video available!

NOTE: If the SAX discretization phase was skipped, due to data already being discrete, please insert the discrete MTS dataset in format .csv in the folder “/resources/output_files” and change the name below from “/discrete_data.csv” to the name of your file. There is an example file in the folder named “/discrete_example_data.csv”.

### Both transitions and complete subjects are scored
### Name of the file where the discrete data is stored:
data_file_name <- "/discrete_data.csv"

DBN_output <- DBN_scoring(data_file_name, write_path, markov_lag = 1, previous_parents = 1, checkbox_stationary = TRUE, score_function = "ll")
DBN_transition_scores <- DBN_output[[1]]
DBN_subject_scores <- DBN_output[[2]]

# Join all the DBN scores (transitions and subjects) in one file/data.frame
DBN_scores <- DBN_transition_scores[,-1]
DBN_scores <- cbind(data.frame(subject_id=subject_id_data),DBN_scores,data.frame(subject_score=DBN_subject_scores$score))

# Restore the true subject_id values that are lost in the process
DBN_transition_scores$subject_id <- NULL
DBN_subject_scores$subject_id <- NULL
DBN_transition_scores <- cbind(data.frame(subject_id = subject_id_data, DBN_transition_scores))
DBN_subject_scores <- cbind(data.frame(subject_id = subject_id_data, DBN_subject_scores))

Visualization of the DBN modeled, this can be stationary or non-stationary. Nodes blue outlined concern attributes at the present time slice (t) being scored. All other nodes concern attributes from previous slices (t-1,…) according to the structure of the transition networks. A node Vi[t] concerns the value of the i-th variable from the dataset at time slice t.

By default, the DBNs are modeled using a LogLikelihood function, a parameter can be changed to used the MDL instead. Such allows a more constraint training, which usually reduces the number of connections in the transition networks.

grViz("dbn_structure_dot") # Plot the DBN using file 'dbn_structure_dot'

The scoring results for transitions and entire subjects are observable below, being the subjects sorted by their subject scores (the lower the value, the more anomalous). The subjects to be displayed can be specified in the function’s argument. The results are written in files DBN_scores_combined.csv and DBN_scores_combined_sort.csv, available in the folder ‘/resources/output_files/DBN’ together with all other DBN outputs.

kable(data.frame(DBN_sorted_scores[1:7,]),caption="DBN transition and subject scores")
DBN transition and subject scores
subject_id t_0 t_1 t_2 t_3 t_4 t_5 t_6 t_7 t_8 t_9 t_10 t_11 t_12 t_13 t_14 t_15 t_16 t_17 t_18 t_19 t_20 t_21 t_22 t_23 t_24 t_25 t_26 t_27 t_28 t_29 t_30 t_31 t_32 t_33 t_34 t_35 t_36 t_37 subject_score
193 192 -0.0409921 -0.0409921 -0.0409921 -0.0409921 -6.9097573 -8.006368 -1.568219 -8.0063676 -7.6642942 -8.0063676 -8.0063676 -3.2898548 -8.006368 -3.289855 -1.100614 -8.006368 -3.289855 -8.0063676 -9.098998 -1.100614 -1.100614 -0.0040040 -0.0040040 -0.0040040 -0.0040040 -0.0040040 -8.0063676 -1.910293 -0.8487380 -6.1742072 -0.8487380 -3.3489018 -6.1742072 -4.1533086 -0.7638126 -0.7638126 -6.1742072 -1.5682194 -3.562528
74 73 -0.0409921 -0.0409921 -0.0409921 -0.0409921 -11.9833502 -1.568219 -8.006368 -1.1006143 -1.1006143 -8.0063676 -3.2898548 -8.0063676 -3.289855 -1.100614 -1.100614 -6.909757 -1.100614 -0.0040040 -8.006368 -3.289855 -0.004004 -6.9097573 -1.1006143 -0.0040040 -6.9097573 -6.9097573 -6.9448128 -13.815511 -8.0063676 -1.9102928 -1.9102928 -0.8487380 -0.0443313 -3.3489018 -0.7638126 -0.7638126 -0.7638126 -0.7638126 -3.414468
199 198 -0.6426953 -0.0443313 -0.0443313 -6.9150290 -3.4444222 -8.006368 -6.909757 -0.8453989 -0.0409921 -0.0409921 -5.0775970 -0.8453989 -11.983350 -1.568219 -8.006368 -8.006368 -1.910293 -3.2898548 -8.006368 -3.289855 -1.100614 -8.0063676 -3.2898548 -1.1006143 -1.1006143 -1.1006143 -1.1006143 -6.909757 -1.1006143 -0.0040040 -0.0040040 -0.0040040 -0.0040040 -0.0040040 -8.0063676 -1.9102928 -0.8487380 -6.1742072 -3.175981
68 67 -1.4471021 -0.0443313 -0.0443313 -3.3489018 -0.7638126 -8.006368 -6.909757 -6.9414737 -0.0409921 -5.0775970 -0.8136825 -0.8453989 -11.983350 -1.910293 -1.910293 -1.910293 -1.910293 -3.2898548 -1.100614 -8.006368 -3.289855 -1.1006143 -1.1006143 -8.0063676 -3.2898548 -1.1006143 -1.1006143 -8.006368 -3.2898548 -1.1006143 -0.0040040 -0.0040040 -0.0040040 -0.0040040 -8.0063676 -1.9102928 -1.9102928 -0.8487380 -2.904794
167 166 -0.0409921 -0.0409921 -0.0409921 -0.0409921 -3.4444222 -8.006368 -7.664294 -6.1742072 -3.2898548 -8.0063676 -1.9102928 -3.2898548 -8.006368 -3.289855 -8.006368 -3.289855 -1.100614 -1.1006143 -1.100614 -1.100614 -1.100614 -6.9097573 -1.1006143 -0.0040040 -0.0040040 -0.0040040 -8.0063676 -2.193245 -8.0063676 -1.9102928 -1.9102928 -1.9102928 -0.8487380 -0.0443313 -0.0443313 -3.3489018 -0.7638126 -0.7638126 -2.837324
108 107 -0.0040040 -0.0040040 -0.0040040 -0.0040040 -6.9097573 -1.100614 -1.100614 -8.0063676 -1.9102928 -3.2898548 -8.0063676 -3.2898548 -8.006368 -3.289855 -8.006368 -9.098998 -6.909757 -0.8136825 -9.098998 -6.909757 -6.909757 -0.8136825 -0.8136825 -0.8136825 -0.8453989 -0.0409921 -0.0409921 -3.444422 -0.7638126 -0.7638126 -0.7638126 -0.7638126 -0.7638126 -0.7638126 -0.7638126 -0.7638126 -0.7638126 -0.7638126 -2.819059
94 93 -0.6426953 -0.0443313 -0.0443313 -3.3489018 -0.7638126 -8.006368 -6.909757 -0.8453989 -0.0409921 -5.0775970 -0.8453989 -5.0775970 -7.719436 -3.289855 -1.100614 -8.006368 -3.289855 -1.1006143 -8.006368 -3.289855 -1.100614 -1.1006143 -1.1006143 -1.1006143 -1.1006143 -1.1006143 -1.1006143 -1.100614 -6.9097573 -1.1006143 -0.0040040 -0.0040040 -8.0063676 -1.9102928 -1.9102928 -1.9102928 -0.8487380 -0.0443313 -2.602730

SCORE-ANALYSIS

A score-analysis phase is preformed to ensure the disclosure of the boundary between both classes “Anomalous” and “Normal”. Two strategies are tested, Tukey’s strategy (Tukey 1977) and Gaussian Mixture Models (GMM) (McLachlan 2018) using 2 classes. The latter is preformed using package “mclust” (Fraley et al. 2017). More information is also available in the webpages of the developed web application. Both score-analysis approaches are applied to subject scores and to transition scores from the DBN approach.

Tukey’s strategy uses the Interquartile Range (IQR), which is the difference between the third and first quartiles of the score’s distribution. This measures the values’ dispersion. The computed boundary comes thus:

\[ Threshold = Q1 - (1.5 \cdot IQR), \quad IQR = Q1 - Q3,\] meaning that scores below the threshold are classified as anomalous. Tukey’s strategy assumes normal scores form a symmetric cluster, meaning outliers form a negative skew in the distribution.

In the GMM score-analysis, disjoint score distributions are tackled. Scores are assumed to be generated from a finite mixture of Gaussian distributions with unknown parameters. Score distributions are modeled as mixtures of two Gaussian curves. Labeling each score becomes a classification problem among two classes C1 and C2, representing abnormality and normality respectively. The problem is defined as uncovering the value of

\[P(C_i|s) = \displaystyle \frac{P(s|C_i) \cdot P(C_i)}{P(s)}, \quad i \in 1,2\] for each score value s, which can be obtained by Bayes’ Rule. The threshold is defined as the boundary that better separates both curves, which describes the point of maximum uncertainty. Such means that scores falling below the threshold have an higher likelihood of belonging to the “abnormal” class C2, being classified as outliers.


METEOR (DBN)

Perform both score-analysis strategies (Tukey and GMM) for both subject and transition scores from the modeled DBNs.

Both tresholds from the score-analysis methods are now used to classify each score.

Transitions

First, each transition score is evaluated with both score-analysis strategies using the corresponding thresholds. These are highlighted in red and observed in the next plots.

Recall that data marked as anomalous refers to the whole transition and not solely the values of that specific slice. An entire anomalous transition is red, which ends in the last time slice generated by that transition. Each anomalous transition contains lag+1 time slices due to the size of the window. In the present case, since the Markov lag is 1, an anomalous transition is represented by a red point at t (first instant of the transition) and a red line connecting this first point t and t+1 which is the last slice of the transition. This last point is not marked as red, since it would indicate an outlier for the next transition (beginning in t+1) instead of the one which generated it.


For the results, please check the folder /resources/output_files/DBN to retrieve the outputs of each phase. It is adviced to open the .csv as text files, since some editors may change the appearance of the content.

ATTENTION: If PAA reduction was performed, the transition plots displaying outliers as red are not correct, since these display the input data before the reduction. Nevertheless, all computations and conclusions are correct. Users should also remember that they performed PAA, and a transition t classified as anomalous does not correspond to the same transition of the input data (in the case of PAA).


Tukey

Transitions classified as anomalous using Tukey’s strategy on data modeled by the DBN.

Tukey’s threshold: -4.7142438

Results of the final DBN transition classification. Parameters can be changed below to specify which subjects to observe.

kable(data.frame(results.DBN.transition.Tukey[1:7,]),caption="DBN final results using Tukey's threshold")
DBN final results using Tukey’s threshold
subject_id t_0 t_1 t_2 t_3 t_4 t_5 t_6 t_7 t_8 t_9 t_10 t_11 t_12 t_13 t_14 t_15 t_16 t_17 t_18 t_19 t_20 t_21 t_22 t_23 t_24 t_25 t_26 t_27 t_28 t_29 t_30 t_31 t_32 t_33 t_34 t_35 t_36 t_37
0 0 0 1 0 0 1 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
2 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
3 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
4 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
5 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
6 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

Score distribution and the final classification of each transition according to Tukey’s strategy.


GMM

Transitions classified as anomalous using the GMM strategy on data modeled by the DBN.

GMM threshold: -2.7415497

Results of the final DBN transition classification. Parameters can be changed below to specify which subjects to observe.

kable(data.frame(results.DBN.transition.GMM[1:7,]),caption="DBN final results using GMM's threshold")
DBN final results using GMM’s threshold
subject_id t_0 t_1 t_2 t_3 t_4 t_5 t_6 t_7 t_8 t_9 t_10 t_11 t_12 t_13 t_14 t_15 t_16 t_17 t_18 t_19 t_20 t_21 t_22 t_23 t_24 t_25 t_26 t_27 t_28 t_29 t_30 t_31 t_32 t_33 t_34 t_35 t_36 t_37
0 0 1 1 0 0 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
2 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
3 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
4 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
5 0 0 0 1 1 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
6 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0

Score distribution and the final classification of each transition according to the GMM strategy.

Subjects possessing transitions classified as anomalous by METEOR with GMM’s score-analysis:

##   [1] "0"   "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10" 
##  [12] "11"  "12"  "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21" 
##  [23] "22"  "23"  "24"  "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32" 
##  [34] "33"  "34"  "35"  "36"  "37"  "38"  "39"  "40"  "41"  "42"  "43" 
##  [45] "44"  "45"  "46"  "47"  "48"  "49"  "50"  "51"  "52"  "53"  "54" 
##  [56] "55"  "56"  "57"  "58"  "59"  "60"  "61"  "62"  "63"  "64"  "65" 
##  [67] "66"  "67"  "68"  "69"  "70"  "71"  "72"  "73"  "74"  "75"  "76" 
##  [78] "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84"  "85"  "86"  "87" 
##  [89] "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96"  "97"  "98" 
## [100] "99"  "100" "101" "102" "103" "104" "105" "106" "107" "108" "109"
## [111] "110" "111" "112" "113" "114" "115" "116" "117" "118" "119" "120"
## [122] "121" "122" "123" "124" "125" "126" "127" "128" "129" "130" "131"
## [133] "132" "133" "134" "135" "136" "137" "138" "139" "140" "141" "142"
## [144] "143" "144" "145" "146" "147" "148" "149" "150" "151" "152" "153"
## [155] "154" "155" "156" "157" "158" "159" "160" "161" "162" "163" "164"
## [166] "165" "166" "167" "168" "169" "170" "171" "172" "173" "174" "175"
## [177] "176" "177" "178" "179" "180" "181" "182" "183" "184" "185" "186"
## [188] "187" "188" "189" "190" "191" "192" "193" "194" "195" "196" "197"
## [199] "198" "199"

Subjects

Each subject score is evaluated with both score-analysis strategies using the corresponding thresholds.

The user can change the next code to specify which subjects to plot and which variable, to compare a single variable TS between subjects. This allows a comparison with the next figures which only plot anomalous subjects. This way, different subjects can be analised as required.

Subjects classified as anomalous for both strategies are plotted next, for data modeled by METEOR.


Tukey

Plot of the anomalous results using Tukey’s score-analysis, some plots can be altered by changing the input of the corresponding functions. The plots only display a single variable, otherwise it would be very confusing to distinguish subjects.

Tukey’s threhsold: -2.6931963

Subjects classified as anomalous by METEOR with Tukey’s score-analysis:

## [1] "67"  "73"  "107" "166" "192" "198"

Results of the final DBN subject classification. Parameters can be changed below to specify which subjects to observe.

kable(data.frame(results.DBN.subject.Tukey[1:7,]),caption="DBN final results using Tukey's threhold")
DBN final results using Tukey’s threhold
subject_id outlier
0 0
1 0
2 0
3 0
4 0
5 0
6 0

Score distribution and the final classification of each subject according to Tukey’s strategy.

***

GMM

Plot of the anomalous results using GMM score-analysis, some plots can be altered by changing the input of the corresponding functions.

GMM threhsold: -2.7108949

Subjects classified as anomalous by the DBN approach with the GMM score-analysis:

## [1] "67"  "73"  "107" "166" "192" "198"

Results of the final DBN subject classification. Parameters can be changed below to specify which subjects to observe.

kable(data.frame(results.DBN.subject.GMM[1:7,]),caption="DBN final results using GMM's threhold")
DBN final results using GMM’s threhold
subject_id outlier
0 0
1 0
2 0
3 0
4 0
5 0
6 0

Score distribution and the final classification of each subject according to the GMM strategy.


Subjects which are classfied as anomalous and at the same time possess anomalous transitions using GMM’s score-analysis are presented below.

## [1] "67"  "73"  "107" "166" "192" "198"

Conclusions and Future Work

Uncovering anomalies among MTS measurements is far from an easy task, not only due to the high instability/variability of values from each specific application, but also due to external/unknown factors which could have caused the anomalies. Most detections are usually caused by abrupt changes or abnormal values in multiple consecutive samples. It’s worth noting that outlier subjects which also possess anomalous transitions are more likely to be interesting.

It’s worth noting that there is room for improvement and future research by enhancing the existing methods and extending the present conclusions for other datasets mainly missing data.


References

Fraley, Chris, Adrian Raftery, Luca Scrucca, Thomas Brendan Murphy, Michael Fop, and Maintainer Luca Scrucca. 2017. “Package Mclust.”

Lin, Jessica, Eamonn J. Keogh, Stefano Lonardi, and Bill Yuan-chi Chiu. 2003. “A Symbolic Representation of Time Series, with Implications for Streaming Algorithms.” In DMKD, 2–11. ACM.

McLachlan, Geoff. 2018. “Finite Mixture Models.” Annual Review of Statistics and Its Application 5 (1). Annual Reviews 4139 El Camino Way, PO Box 10139, Palo Alto, California 94303-0139, USA.

Monteiro, Jose L, Susana Vinga, and Alexandra M Carvalho. 2015. “Polynomial-Time Algorithm for Learning Optimal Tree-Augmented Dynamic Bayesian Networks.” In UAI, 622–31.

Serras, Jorge Luis Fuzeta. 2018. “Dynamic Bayesian Outlier Detection.” https://jorgeserras.shinyapps.io/outlierdetection/.

Tukey, John W. 1977. Exploratory Data Analysis. Vol. 2. Reading, Mass.